Many people are aware of Presidential and so-called “off-cycle” or midterm elections, but many don’t realize that Utah has elections every year. Voter identification in Presidential election years has long been fairly simple. It is somewhat harder to identify mid-term election voters, but increasing campaign budgets have allowed even some down-ballot office-seekers to do rudimentary data mining operations. In odd years, cities hold non-partisan municipal elections. In alternating 2 year intervals cities elect a Mayor and some councilmembers or the other members of the city council. The mayoral election generally has a higher turnout “off-off cycle” and the other election its lucky if anyone even knows it’s happening, “off-off-off cycle”. 2019 was a non-mayoral city council race in Saratoga Springs.
Local elections are often plagued by a lack of interest, or awareness. They also have minimal funds devoted to them, and are often run without staff and primarily by the candidate and close friend volunteers. Saratoga Springs is a city of roughly 35,000 residents, approximately 13,000 of which are registered to vote. Municipal elections typically run 15-25% voter turnout meaning that roughly 3,000 people can be expected to vote. The difference for a candidate sending a mailer to 3,000 voters versus 13,000 can equate to several thousand dollars. Since most candidates in Saratoga Springs have less than $3000 to spend, effective and efficient use of money is paramount.
The goal is therefore to accurately identify registered voters who are most likely to participate in the 2019 Saratoga Springs municipal election and to create a model that will allow prediction of future voter turnout as well.
The data available is public voter records obtained from the Utah County Clerk. There are 11,330 observations which contain 136 attributes including some limited demographic information, as well as voting history (if they voted, not for whom). As was noted above, there are more than 13,000 registered voters, so roughly 2000 voters’ information is not contained in these records as they have opted to have their information kept private. This poses an additional challenge to comprehensive identification. This data was anonymized, and in some instances transformed in order to facilitate use for modelling. In addition, addresses were geocoded to Lat/Lon coordinates using the Google Maps API.
The data was explored for erroneous data and outliers, a few instances were identified and excluded from the data set. Several attributes were discarded due to the fact that they are the same for all observations due to the limited geographical area. One of the provided demographics is age. This was plotted, and a fairly normal distribution was observed. Saratoga Springs is slightly right-skewed. This age distribution was also explored to see if it differed markedly by voting precinct. Apart from a slightly higher distribution in SR07, there was not a significant difference in precinct make-up in terms of age.
As the majority of the records deal with voting records, the data was transformed into a binary classification of Voted: True/False and apriori analysis was conducted using the 2017 Primary and 2017 General Elections as the outcomes for rule-making. This resulted in 202 association rules for Primary 2017 and 65 rules for General 2017.
#Rules for General elections
rules.G2017 <- apriori(voter.apriori[,1:16], parameter=list (supp=0.05,conf = 0.5), appearance = list (default="lhs",rhs="Voted_X11.7.2017"), control = list (verbose=F))
#Examine General Election Rules
rules.G2017
## set of 65 rules
## lhs rhs support confidence lift count
## [1] {Voted_X11.3.2015,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05354303 0.8393352 4.008268 606
## [2] {Voted_X11.3.2015,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05531013 0.8357810 3.991295 626
## [3] {Voted_X11.6.2012,
## Voted_X11.5.2013,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05257113 0.8117326 3.876451 595
## [4] {Voted_X11.6.2012,
## Voted_X11.5.2013,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05478000 0.8115183 3.875428 620
## [5] {Voted_X11.5.2013,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05575190 0.8100128 3.868239 631
## [6] {Voted_X11.2.2010,
## Voted_X11.4.2014,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05036225 0.8096591 3.866549 570
## [7] {Voted_X11.5.2013,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05813748 0.8073620 3.855579 658
## [8] {Voted_X11.2.2010,
## Voted_X11.6.2012,
## Voted_X11.4.2014,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05045061 0.8064972 3.851449 571
## [9] {Voted_X11.2.2010,
## Voted_X11.4.2014,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05195264 0.8054795 3.846589 588
## [10] {Voted_X11.6.2012,
## Voted_X11.4.2014,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.06370383 0.7949283 3.796202 721
## lhs rhs support confidence lift count
## [1] {Voted_X8.15.2017} => {Voted_X11.7.2017} 0.11168051 0.7247706 3.461162 1264
## [2] {Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.10293338 0.7410941 3.539115 1165
## [3] {Voted_X11.5.2013,
## Voted_X11.8.2016} => {Voted_X11.7.2017} 0.09135890 0.5146839 2.457887 1034
## [4] {Voted_X11.6.2012,
## Voted_X11.5.2013} => {Voted_X11.7.2017} 0.08976851 0.5029703 2.401948 1016
## [5] {Voted_X11.3.2015} => {Voted_X11.7.2017} 0.08711787 0.5615034 2.681475 986
## [6] {Voted_X11.6.2012,
## Voted_X11.5.2013,
## Voted_X11.8.2016} => {Voted_X11.7.2017} 0.08623432 0.5219251 2.492468 976
## [7] {Voted_X11.6.2012,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.08526241 0.7533177 3.597489 965
## [8] {Voted_X11.3.2015,
## Voted_X11.8.2016} => {Voted_X11.7.2017} 0.08376038 0.5690276 2.717407 948
## [9] {Voted_X11.6.2012,
## Voted_X11.8.2016,
## Voted_X8.15.2017} => {Voted_X11.7.2017} 0.08172822 0.7594417 3.626735 925
## [10] {Voted_X11.5.2013,
## Voted_X11.4.2014} => {Voted_X11.7.2017} 0.07819403 0.5473098 2.613693 885
The rules were reviewed and commonalities were identified for use in identifying which instances of the voting history were most relevant to later outcomes.
| Most Common Elections from Apriori Rules |
|---|
| 8/15/2017 |
| 11/8/2016 |
| 11/3/2015 |
| 11/4/2014 |
| 11/5/2013 |
| 11/6/2012 |
| 11/2/2010 |
First it was attempted to see if there were any groupings of voters based upon their voting history and age (Option 1). The data was analyzed using within-sum-of-squares to determine the optimal k for clustering.
It was determined to use k=3.
| clu.voters1.size | Age | Voted_X6.22.2010 | Voted_X11.2.2010 | Voted_X9.13.2011 | Voted_X11.8.2011 | Voted_X6.26.2012 | Voted_X11.6.2012 | Voted_X8.13.2013 | Voted_X11.5.2013 | Voted_X6.24.2014 | Voted_X11.4.2014 | Voted_X8.11.2015 | Voted_X11.3.2015 | Voted_X6.28.2016 | Voted_X11.8.2016 | Voted_X8.15.2017 | Voted_X11.7.2017 | Voted_X6.26.2018 | Voted_X11.6.2018 | Voted_X11.5.2019 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1983 | 66.21432 | 0.2072617 | 0.4911750 | 0.1043873 | 0.1820474 | 0.2702975 | 0.6807867 | 0.1412002 | 0.3706505 | 0.1018659 | 0.5385779 | 0.1674231 | 0.2955119 | 0.2455875 | 0.8023197 | 0.3469491 | 0.4412506 | 0.4977307 | 0.8340898 | 0.4583964 |
| 4064 | 27.92618 | 0.0164862 | 0.0725886 | 0.0083661 | 0.0191929 | 0.0295276 | 0.3228346 | 0.0145177 | 0.0494587 | 0.0073819 | 0.1026083 | 0.0152559 | 0.0509350 | 0.0406004 | 0.5720965 | 0.0760335 | 0.1013780 | 0.1023622 | 0.5243602 | 0.1395177 |
| 5271 | 42.89148 | 0.0939101 | 0.3204326 | 0.0861317 | 0.1646746 | 0.1185733 | 0.6619237 | 0.0785430 | 0.2375261 | 0.0292165 | 0.3614115 | 0.0840448 | 0.1826978 | 0.0855625 | 0.7981408 | 0.1417188 | 0.2054639 | 0.2054639 | 0.6808955 | 0.2822994 |
This clustering resulted in inconclusive results as the centroid clusters weren’t clearly differentiated. It was determined to attempt clustering using only the elections which were identified by the apriori analysis previously discussed.
It was determined to use k = 4.
| clu.voters2.size | Age | Voted_X11.2.2010 | Voted_X11.6.2012 | Voted_X11.5.2013 | Voted_X11.4.2014 | Voted_X11.3.2015 | Voted_X11.8.2016 | Voted_X8.15.2017 |
|---|---|---|---|---|---|---|---|---|
| 2751 | 1.695061 | 0.7728099 | 0.9701927 | 0.6586696 | 0.9160305 | 0.5038168 | 0.9291167 | 0.3678662 |
| 2998 | 1.515949 | 0.0490327 | 0.1914610 | 0.0183456 | 0.0456971 | 0.0136758 | 0.0000000 | 0.0436958 |
| 2661 | 1.563180 | 0.0477264 | 0.0000000 | 0.0289365 | 0.0950770 | 0.0571214 | 1.0000000 | 0.1180008 |
| 2908 | 1.605510 | 0.1918845 | 1.0000000 | 0.0839065 | 0.1650619 | 0.0608666 | 0.9993122 | 0.0986933 |
This resulted in more defined clusters, one which clearly represented likely voters, one which represented rare voters. The other two clusters didn’t clearly favor one or the other and weren’t readily distinguishable from each other. It was determined to move forward with only the “Likely Voter” cluster.
This “Likely Voters” Cluster was plotted on a map of Saratoga Springs to see if there were any identifiable patterns.
This plot was then compared with a plot of Actual Voters that was obtained from the data set.
When compared visually there are significant similarities, but it is possible even visually to determine that there were differences between the two plots, with their Actual Voter plot showing differing distribution as well as more overall voters in some areas of the city.
#Number of Likely Voters
nrow(likely.voters)
## [1] 2751
#Number of Actual Voters
nrow(actual.voters)
## [1] 2964
When comparing these numbers, it appears that clustering does a pretty accurate job at identifying likely voters, however looking a little deeper at the number of voters that appear in both the Likely Voter subset and the Actual Voter subset is illuminating.
correct.kmeans <- semi_join(actual.voters, likely.voters, by = "Voter.ID")
nrow(correct.kmeans)
## [1] 1437
The actual number of correctly identified voters is 1437, this only represents 48% accuracy.
When looking at the second plot of Actual Voters it was noted that the locations of higher voter density seemed to be located close to the homes of candidates. Data was transformed to add a variable which calculated the distance (m) between the voter’s home and that of the nearest candidate.
In order to determine the influence that distance from a candidate that might have on clustering, the distance (m) between the voter and the closest candidate was calculated and added to the data. This number was then log10 transformed to standardize it’s effect on clustering. This data set was then analyzed to determine the optimal k value.
It was determined to use k = 4.
| clu.voters3.size | Age | Voted_X11.2.2010 | Voted_X11.6.2012 | Voted_X11.5.2013 | Voted_X11.4.2014 | Voted_X11.3.2015 | Voted_X11.8.2016 | Voted_X8.15.2017 | min.c.dist |
|---|---|---|---|---|---|---|---|---|---|
| 2193 | 1.561480 | 0.0469676 | 0.0000000 | 0.0287278 | 0.0939352 | 0.0565435 | 1.0000000 | 0.1158231 | 2.955462 |
| 2250 | 1.693608 | 0.7711111 | 0.9697778 | 0.6608889 | 0.9133333 | 0.5057778 | 0.9315556 | 0.3662222 | 2.933730 |
| 2484 | 1.517365 | 0.0438808 | 0.1916264 | 0.0181159 | 0.0454911 | 0.0136876 | 0.0000000 | 0.0390499 | 2.956943 |
| 2413 | 1.606829 | 0.1989225 | 1.0000000 | 0.0841276 | 0.1595524 | 0.0646498 | 0.9975135 | 0.1002901 | 2.940828 |
nrow(actual.voters)
## [1] 2964
correct.kmeans.dist <- semi_join(actual.voters, likely.voters.dist, by = "Voter.ID")
nrow(correct.kmeans.dist)
## [1] 1167
We can see that inclusion of the distance from candidates actually decreases the accuracy somewhat.
While the model has some value as a means to eliminate likely Non-Voters, it is not yet robust enough in order to correctly identify which voters are likely to participate in municipal elections.